sequence classification
Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification
Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between k-mers (k-length subsequences) in the two sequences. Extending this definition, by considering two k-mers to match if their distance is at most m, yields better classification performance. This, however, makes the problem computationally much more complex. Known algorithms to compute this similarity have computational complexity that render them applicable only for small values of k and m.
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New Jersey (0.04)
- (2 more...)
XAI-Driven Deep Learning for Protein Sequence Functional Group Classification
Chakraborty, Pratik, Bhargava, Aryan
Proteins perform essential biological functions, and accurate classification of their sequences is critical for understanding structure-function relationships, enzyme mechanisms, and molecular interactions. This study presents a deep learning-based framework for functional group classification of protein sequences derived from the Protein Data Bank (PDB). Four architectures were implemented: Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), CNN-BiLSTM hybrid, and CNN with Attention. Each model was trained using k-mer integer encoding to capture both local and long-range dependencies. Among these, the CNN achieved the highest validation accuracy of 91.8%, demonstrating the effectiveness of localized motif detection. Explainable AI techniques, including Grad-CAM and Integrated Gradients, were applied to interpret model predictions and identify biologically meaningful sequence motifs. The discovered motifs, enriched in histidine, aspartate, glutamate, and lysine, represent amino acid residues commonly found in catalytic and metal-binding regions of transferase enzymes. These findings highlight that deep learning models can uncover functionally relevant biochemical signatures, bridging the gap between predictive accuracy and biological interpretability in protein sequence analysis.
Metadata-Aligned 3D MRI Representations for Contrast Understanding and Quality Control
Avci, Mehmet Yigit, Borges, Pedro, Fernandez, Virginia, Wright, Paul, Yigitsoy, Mehmet, Ourselin, Sebastien, Cardoso, Jorge
Magnetic Resonance Imaging suffers from substantial data heterogeneity and the absence of standardized contrast labels across scanners, protocols, and institutions, which severely limits large-scale automated analysis. A unified representation of MRI contrast would enable a wide range of downstream utilities, from automatic sequence recognition to harmonization and quality control, without relying on manual annotations. To this end, we introduce MR-CLIP, a metadata-guided framework that learns MRI contrast representations by aligning volumetric images with their DICOM acquisition parameters. The resulting embeddings shows distinct clusters of MRI sequences and outperform supervised 3D baselines under data scarcity in few-shot sequence classification. Moreover, MR-CLIP enables unsupervised data quality control by identifying corrupted or inconsistent metadata through image-metadata embedding distances. By transforming routinely available acquisition metadata into a supervisory signal, MR-CLIP provides a scalable foundation for label-efficient MRI analysis across diverse clinical datasets.
- North America > United States > Virginia (0.40)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
A Fairness metric
Dynabench comprises four dynamic tasks with multiple rounds of datasets that will grow over time. These names cover 85.6 percent of the U.S. population, are based on the 1990 Census information on first name frequencies [ Note: there is nothing inherently "racial" about particular names--for example, each demographic group had at least a few people named "Anna" or "Benjamin" [ For gender identity, we investigate two kinds of perturbations: names and noun phrases. Social Security Association's Lists of Baby Names (1980-2019), and performed perturbations as For noun phrases (i.e., pronouns and nouns), we adopted a slightly more "her" with a noun like "dad" and expect no effect on the classification label). Given that our perturbations are heuristic, some noise is to be expected. Consider "I've always enjoyed eating at Red Robin " being perturbed to "I've always enjoyed eating at Red Kayla ".
A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge
Lorello, Luca Salvatore, Lippi, Marco, Melacci, Stefano
One of the goals of neuro-symbolic artificial intelligence is to exploit background knowledge to improve the performance of learning tasks. However, most of the existing frameworks focus on the simplified scenario where knowledge does not change over time and does not cover the temporal dimension. In this work we consider the much more challenging problem of knowledge-driven sequence classification where different portions of knowledge must be employed at different timesteps, and temporal relations are available. Our experimental evaluation compares multi-stage neuro-symbolic and neural-only architectures, and it is conducted on a newly-introduced benchmarking framework. Results demonstrate the challenging nature of this novel setting, and also highlight under-explored shortcomings of neuro-symbolic methods, representing a precious reference for future research.
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- North America > Canada > Nova Scotia > Halifax Regional Municipality > Halifax (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
- Asia > Indonesia (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.92)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
Can Large Language Models Predict Antimicrobial Resistance Gene?
This study demonstrates that generative large language models can be utilized in a more flexible manner for DNA sequence analysis and classification tasks compared to traditional transformer encoder-based models. While recent encoder-based models such as DNABERT and Nucleotide Transformer have shown significant performance in DNA sequence classification, transformer decoder-based generative models have not yet been extensively explored in this field. This study evaluates how effectively generative Large Language Models handle DNA sequences with various labels and analyzes performance changes when additional textual information is provided. Experiments were conducted on antimicrobial resistance genes, and the results show that generative Large Language Models can offer comparable or potentially better predictions, demonstrating flexibility and accuracy when incorporating both sequence and textual information. The code and data used in this work are available at the following GitHub repository: https://github.com/biocomgit/llm4dna.
Neuromorphic Spiking Neural Network Based Classification of COVID-19 Spike Sequences
Murad, Taslim, Chourasia, Prakash, Ali, Sarwan, Khan, Imdad Ullah, Patterson, Murray
The availability of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) virus data post-COVID has reached exponentially to an enormous magnitude, opening research doors to analyze its behavior. Various studies are conducted by researchers to gain a deeper understanding of the virus, like genomic surveillance, etc, so that efficient prevention mechanisms can be developed. However, the unstable nature of the virus (rapid mutations, multiple hosts, etc) creates challenges in designing analytical systems for it. Therefore, we propose a neural network-based (NN) mechanism to perform an efficient analysis of the SARS-CoV-2 data, as NN portrays generalized behavior upon training. Moreover, rather than using the full-length genome of the virus, we apply our method to its spike region, as this region is known to have predominant mutations and is used to attach to the host cell membrane. In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed-length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences. We compare the performance of our method with various baselines using real-world SARS-CoV-2 spike sequence data and show that our method is able to achieve higher predictive accuracy compared to the recent baselines.
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)
- North America > United States (0.04)
EPIC: Enhancing Privacy through Iterative Collaboration
Chourasia, Prakash, Lonkar, Heramb, Ali, Sarwan, Patterson, Murray
Advancements in genomics technology lead to a rising volume of viral (e.g., SARS-CoV-2) sequence data, resulting in increased usage of machine learning (ML) in bioinformatics. Traditional ML techniques require centralized data collection and processing, posing challenges in realistic healthcare scenarios. Additionally, privacy, ownership, and stringent regulation issues exist when pooling medical data into centralized storage to train a powerful deep learning (DL) model. The Federated learning (FL) approach overcomes such issues by setting up a central aggregator server and a shared global model. It also facilitates data privacy by extracting knowledge while keeping the actual data private. This work proposes a cutting-edge Privacy enhancement through Iterative Collaboration (EPIC) architecture. The network is divided and distributed between local and centralized servers. We demonstrate the EPIC approach to resolve a supervised classification problem to estimate SARS-CoV-2 genomic sequence data lineage without explicitly transferring raw sequence data. We aim to create a universal decentralized optimization framework that allows various data holders to work together and converge to a single predictive model. The findings demonstrate that privacy-preserving strategies can be successfully used with aggregation approaches without materially altering the degree of learning convergence. Finally, we highlight a few potential issues and prospects for study in FL-based approaches to healthcare applications.
- North America > United States > California (0.14)
- Europe > United Kingdom > Scotland (0.05)
- Europe > Sweden (0.05)
- (10 more...)